[SPARK-34819][SQL] MapType supports orderable semantics #31967

WangGuangxin · 2021-03-26T05:40:35Z

What changes were proposed in this pull request?

Currently MapType doesn't support orderable semantics, while it's supported in Hive/Presto. This makes it hard to migrate from Hive to SparkSQL if user have groupby/orderby map type in their sql.

Why are the changes needed?

Generally, we compare two maps by the following steps:

If the size of two maps are not equal, compare them by size.
Otherwise, sort each map entry by map key, then compare two map entries one by one, first compare by key, then value.

We have to specially handle this in grouping/join/window because Spark SQL turns grouping/join/window partition keys into binary UnsafeRow and compare the binary data directly instead of using MapType's ordering. In this case, we have to insert a SortMapKey expression to sort map entry by key. This is very similiar to NormalizeFloatingNumbers

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add more UTs

WangGuangxin · 2021-03-26T05:53:29Z

@hvanhovell @cloud-fan @maropu Could you please help review this?

c21

Is it a re-proposal for https://issues.apache.org/jira/browse/SPARK-18134 ? I thought the decision made on that JIRA is we won't support this. But we also encountered this when migrating from Hive to Spark. We worked around this by adding a logical plan rule to covert map to sorted array if needed.

hvanhovell · 2021-03-26T06:07:53Z

@c21 we should support this. I just ran out of time when I was working on it.

hvanhovell · 2021-03-26T08:53:35Z

Ok to test

SparkQA · 2021-03-26T10:57:59Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41140/

SparkQA · 2021-03-26T11:09:29Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41140/

SparkQA · 2021-03-26T11:24:41Z

Test build #136559 has finished for PR 31967 at commit f60a5c4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-26T12:01:11Z

Test build #136556 has finished for PR 31967 at commit a08356a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-26T12:41:27Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41143/

SparkQA · 2021-03-26T12:51:00Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41143/

SparkQA · 2021-03-28T13:11:19Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41196/

SparkQA · 2021-03-28T13:18:30Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41196/

SparkQA · 2021-03-28T16:54:05Z

Test build #136614 has finished for PR 31967 at commit 58bb3cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2021-03-29T02:34:08Z

Links to the previous PRs: #15970 and #19330

maropu · 2021-03-30T06:39:49Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

@@ -687,6 +688,118 @@ class CodegenContext extends Logging {
          }
        """
      s"${addNewFunction(compareFunc, funcCode)}($c1, $c2)"
+    case _ @ MapType(keyType, valueType, valueContainsNull) =>


What's a difference from the @hvanhovell impl.? The @hvanhovell one looks simpler though.
https://github.com/apache/spark/pull/15970/files#diff-1501206e78d34b65183af1092c8ec392ce18574bb538f905ca93a22983c63ae6R558-R598

Btw, we cannot reuse the Array case? https://github.com/apache/spark/pull/31967/files#diff-1501206e78d34b65183af1092c8ec392ce18574bb538f905ca93a22983c63ae6R643

maropu · 2021-03-30T06:40:56Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala

@@ -141,6 +139,27 @@ object NormalizeFloatingNumbers extends Rule[LogicalPlan] {
      val function = normalize(lv)
      KnownFloatingPointNormalized(ArrayTransform(expr, LambdaFunction(function, Seq(lv))))

+    case _ if expr.dataType.isInstanceOf[MapType] =>
+      val MapType(kt, vt, containsNull) = expr.dataType
+      var normalized = if (needNormalize(kt)) {


Could you avoid to use var here?

Could you add tests for this new code path in NormalizeFloatingPointNumbersSuite?

maropu · 2021-03-30T06:43:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeMapType.scala

+ */
+object NormalizeMapType extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case w: Window if w.partitionSpec.exists(p => needNormalize(p)) =>


You didn't support BinaryComparison cases?

maropu · 2021-03-30T06:44:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeMapType.scala

+
+  override def nullSafeEval(input: Any): Any = {
+    val childMap = input.asInstanceOf[MapData]
+    val keys = childMap.keyArray()


We don't need to sort data recursively just for nested case like map<map<int,int>,string> and map<struct<a: map<int,int>>,string>)?

Seems that I missed this case. I'll fix it

maropu · 2021-03-30T06:45:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeMapType.scala

+  }
+}
+
+case class SortMapKey(child: Expression) extends UnaryExpression with ExpectsInputTypes {


SortMapKey -> SortMapKeys?

maropu · 2021-03-30T06:47:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeMapType.scala

+    mapBuilder.build()
+  }
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {


To make this PR simpler, how about leaving the codegen support into follow-up PRs just like the original PR? https://github.com/apache/spark/pull/15970/files#diff-da163d97a5f0fc534aad719c4a39eca97116f25bfc05b7d8941b342a3ed96036R423-R429

maropu · 2021-03-30T06:47:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

    Batch("ReplaceUpdateFieldsExpression", Once, ReplaceUpdateFieldsExpression)

+


nit: unnecessary change.

maropu · 2021-03-30T06:48:33Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeMapTypeSuite.scala

+  }
+}
+
+


nit: remove unnecessary blank lines.

maropu · 2021-03-30T06:52:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeMapType.scala

+ * to insert an expression to sort map entries by key.
+ *
+ * Note that, this rule must be executed at the end of optimizer, because the optimizer may create
+ * new joins(the subquery rewrite) and new join conditions(the join reorder).


Could you leave some comments about why this rule does not handle the Aggregate cases? https://github.com/apache/spark/pull/31967/files#diff-21f071d73070b8257ad76e6e16ec5ed38a13d1278fe94bd42546c258a69f4410R344

maropu · 2021-03-30T07:00:28Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+
+  test("SPARK-34819: MapType has nesting complex type supports orderable semantics") {
+    Seq(CodegenObjectFactoryMode.CODEGEN_ONLY.toString,
+      CodegenObjectFactoryMode.NO_CODEGEN.toString).foreach {


Could you move the two tests into SQLQueryTestSuite? You can use the CONFIG_DIM directive there:
https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/inputs/postgreSQL/join.sql#L18-L20

maropu · 2021-04-26T08:06:13Z

Any update?

maropu · 2021-05-13T03:40:26Z

@WangGuangxin If you cannot keep working on it, is it okay that I take this over?

WangGuangxin · 2021-05-13T06:10:47Z

@WangGuangxin If you cannot keep working on it, is it okay that I take this over?

Sure, I'm stuck with something else, you can take this over if you have time. Thanks

github-actions · 2021-08-22T00:08:54Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

WangGuangxin added 3 commits March 26, 2021 10:47

MapType supports comparable/orderable semantics

7d16098

update

ac712ce

update

e9366fe

github-actions bot added the SQL label Mar 26, 2021

c21 reviewed Mar 26, 2021

View reviewed changes

fix style

a08356a

fix ut

58bb3cc

WangGuangxin force-pushed the map_type_orderable_4 branch from f60a5c4 to 58bb3cc Compare March 28, 2021 12:28

maropu changed the title ~~[SPAKR-34819][SQL]MapType supports orderable semantics~~ [SPARK-34819][SQL]MapType supports orderable semantics Mar 29, 2021

maropu changed the title ~~[SPARK-34819][SQL]MapType supports orderable semantics~~ [SPARK-34819][SQL] MapType supports orderable semantics Mar 30, 2021

maropu reviewed Mar 30, 2021

View reviewed changes

maropu mentioned this pull request May 14, 2021

[SPARK-34819][SQL] MapType supports comparable semantics #32552

Closed

HyukjinKwon mentioned this pull request Aug 9, 2021

[SPARK-36452][SQL]: Add the support in Spark for having group by map datatype column for the scenario that works in Hive #33679

Closed

github-actions bot added the Stale label Aug 22, 2021

github-actions bot closed this Aug 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34819][SQL] MapType supports orderable semantics #31967

[SPARK-34819][SQL] MapType supports orderable semantics #31967

WangGuangxin commented Mar 26, 2021

WangGuangxin commented Mar 26, 2021

c21 left a comment

hvanhovell commented Mar 26, 2021

hvanhovell commented Mar 26, 2021

SparkQA commented Mar 26, 2021

SparkQA commented Mar 26, 2021

SparkQA commented Mar 26, 2021

SparkQA commented Mar 26, 2021

SparkQA commented Mar 26, 2021

SparkQA commented Mar 26, 2021

SparkQA commented Mar 28, 2021

SparkQA commented Mar 28, 2021

SparkQA commented Mar 28, 2021

maropu commented Mar 29, 2021 •

edited

Loading

maropu Mar 30, 2021

maropu Mar 30, 2021

maropu Mar 30, 2021

maropu Mar 30, 2021

maropu Mar 30, 2021

maropu Mar 30, 2021

WangGuangxin Apr 2, 2021

maropu Mar 30, 2021

maropu Mar 30, 2021

maropu Mar 30, 2021

maropu Mar 30, 2021

maropu Mar 30, 2021

maropu Mar 30, 2021

maropu commented Apr 26, 2021

maropu commented May 13, 2021

WangGuangxin commented May 13, 2021

github-actions bot commented Aug 22, 2021

		Batch("ReplaceUpdateFieldsExpression", Once, ReplaceUpdateFieldsExpression)

[SPARK-34819][SQL] MapType supports orderable semantics #31967

[SPARK-34819][SQL] MapType supports orderable semantics #31967

Conversation

WangGuangxin commented Mar 26, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

WangGuangxin commented Mar 26, 2021

c21 left a comment

Choose a reason for hiding this comment

hvanhovell commented Mar 26, 2021

hvanhovell commented Mar 26, 2021

SparkQA commented Mar 26, 2021

SparkQA commented Mar 26, 2021

SparkQA commented Mar 26, 2021

SparkQA commented Mar 26, 2021

SparkQA commented Mar 26, 2021

SparkQA commented Mar 26, 2021

SparkQA commented Mar 28, 2021

SparkQA commented Mar 28, 2021

SparkQA commented Mar 28, 2021

maropu commented Mar 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Apr 26, 2021

maropu commented May 13, 2021

WangGuangxin commented May 13, 2021

github-actions bot commented Aug 22, 2021

maropu commented Mar 29, 2021 •

edited

Loading